Drake: powerful tool for automatic reproducible workflow

Reproducible Workflow
A legacy R workflow note showing how drake can cache calculations and speed up R Markdown report rendering, with a note on modern targets workflows.
Author

Yang Liu

Published

September 15, 2019

drake is a powerful tool for reproducible R workflows. I found it especially useful when paired with R Markdown reports, because it can cache expensive intermediate results and only rebuild the objects that have changed.

Maintenance note: this is a legacy drake example. For new R workflow projects I would usually look at targets, which follows the same core idea with a newer interface. I am keeping the original drake code here because the example is still useful for understanding target-based workflows.

Using SHAPforxgboost as an example:

# if needed, update drake
if(packageVersion("drake") < "7.4") install.packages("drake")
if(packageVersion("SHAPforxgboost") < "0.0.3") install.packages("SHAPforxgboost")

suppressPackageStartupMessages({
  library("drake")
  library("SHAPforxgboost")
  library("here")
})

# assign a place to store intermediate objects
cache_path <- here("Drake_Cache")
if(!dir.exists(cache_path))dir.create(cache_path)
cache <- drake_cache(path = cache_path)

The drake_plan takes user-defined functions to create each target. These functions are usually written in a separate script.

get.xgb.mod <- function(dataX){
  y_var <- "diffcwv" 
  # hyperparameter tuning results
  param_dart <- list(objective = "reg:linear",  # For regression
                   nrounds = 366,
                   eta = 0.018,
                   max_depth = 10,
                   gamma = 0.009,
                   subsample = 0.98,
                   colsample_bytree = 0.86)

  mod <- xgboost::xgboost(data = as.matrix(dataX), 
                        label = as.matrix(dataXY_df[[y_var]]), 
                       xgb_param = param_dart, nrounds = param_dart$nrounds,
                       verbose = FALSE, nthread = parallel::detectCores() - 2,
                       early_stopping_rounds = 8)
  return(mod)
}

# ...
# define more functions if needed
# ...

Markdown all the results to the final report. The great advantage is that since all the figures were done and stored before the markdown process, if you modify a figure, only that figure needs to be rerun.

my_plan <- drake_plan(
  dataX = data.table::copy(dataXY_df[,-"diffcwv"]),
  xgb_mod = get.xgb.mod(dataX),
  shap_long = shap.prep(xgb_model = xgb_mod, X_train = dataX, top_n = 4),
  # make a diluted (faster) summary plot showing only top 4 variables:
  fig1 = shap.plot.summary(shap_long, dilute = 10),
  fig2 = shap.plot.dependence(data_long = shap_long, x = 'dayint', y = 'dayint', color_feature = 'Column_WV'),
  fig3 = shap.plot.dependence(data_long = shap_long, x = 'dayint', y = 'Column_WV', color_feature = 'Column_WV'),
  
  report = rmarkdown::render(
    knitr_in("Code/drake_md_report.Rmd"),
    output_format = rmarkdown::html_document(toc = TRUE))
)

nemia_config <- drake_config(my_plan, cache = cache) # show the dependency
# vis_drake_graph(nemia_config, from = names(nemia_config$layout))
vis_drake_graph(nemia_config)

# run the plan
make(my_plan, cache = cache)

Notice that it is not a good idea to run drake inside an R Markdown file. A drake workflow is usually an R script that uses R Markdown only as the reporting layer.

Here is how the dependency graph looks like:

If we add an extra figure, only this figure (the black fig3) needs to made:

Here is how the md file looks like on GitHub

The drake work plan then generates the HTML report automatically (drake_md_report.html), which looks like this: